Automatic Extraction of Logical Web Lists

نویسندگان

  • Pasqua Fabiana Lanotte
  • Fabio Fumarola
  • Michelangelo Ceci
  • Andrea Scarpino
  • Michele Damiano Torelli
  • Donato Malerba
چکیده

Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial view. Similar to databases, where a view can represent a subset of the data contained in a table, they split a logical list in multiple views (view lists). Automatic extraction of logical lists is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for logical list extraction. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Entity Extraction from Unstructured Data on the Web

A large number of web pages contain information about entities in lists where the lists are represented in textual form. Textual lists contain implicit records of entities. However, the field values of such records cannot easily be separated or extracted by automatic processes. This, therefore, remains a challenging research problem in the literature. Previous studies in the literature relied m...

متن کامل

Information Extraction in Semantic Wikis

This paper deals with information extraction technologies supporting semantic annotation and logical organization of textual content in semantic wikis. We describe our work in the context of the KiWi project which aims at developing a new knowledge management system motivated by the wiki way of collaborative content creation that is enhanced by the semantic web technology. The specific characte...

متن کامل

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different ma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014